Building a Linguistically Interpreted Corpus of Bulgarian: the BulTreeBank
نویسندگان
چکیده
In the field of Human Language Technology (HLT), the existence of linguistically interpreted real-world texts provides the license necessary for a given language to enter the area of high-tech applications. The significance of BulTreeBank is the granting of an HLT license to a “less processed” language like Bulgarian which, until recently, has been formally modelled and processed mainly on the morphology level. The BulTreeBank project aims at the creation of syntactically annotated data for Bulgarian and the tools for their production, management and automatic processing. It provides not only language resources, but develops an infrastructure of research solutions, production scenarios and services.
منابع مشابه
Segmentation Layers in the Group of the Predicate: a Case Study of Bulgarian within the BulTreeBank Framework∗
This paper describes the development of a regular grammar that automatically recognizes and delimits segments in the group of the predicate in sentences of Bulgarian. The language-specific segmentation is performed at the level of partial parsing where reliable, meaningful and useful entities are formed called chunks. The significance of the grammar development lies in the fact that it is a plu...
متن کاملChallenges Behind the Data-driven Bulgarian WordNet (BulTreeBank Bulgarian Wordnet)
The paper presents our work towards the simultaneous creation of a data-driven WordNet for Bulgarian and a manually annotated treebank with semantic information. Such an approach requires synchronization of the word senses in both syntactic and lexical resources, without limiting the WordNet senses to the corpus or vice versa. Our strategy focuses on the identification of senses used in BulTree...
متن کاملA Data-Driven Dependency Parser for Bulgarian
One of the main motivations for building treebanks is that they facilitate the development of syntactic parsers, by providing realistic data for evaluation as well as inductive learning. In this paper we present what we believe to be the first robust data-driven parser for Bulgarian, trained and evaluated on data from BulTreeBank (Simov et al., 2002). The parser uses dependency-based representa...
متن کاملA Treebank-driven Creation of an OntoValence Verb lexicon for Bulgarian
The paper presents a treebank-driven approach to the construction of a Bulgarian valence lexicon with ontological restrictions over the inner participants of the event. First, the underlying ideas behind the Bulgarian Ontology-based lexicon are outlined. Then, the extraction and manipulation of the valence frames is discussed with respect to the BulTreeBank annotation scheme and DOLCE ontology....
متن کاملA Publicly Available Cross-Platform Lemmatizer for Bulgarian
Our dictionary-based lemmatizer for the Bulgarian language presented here is distributed as free software, publicly available to download and use under the GPL v3 license. The presented software is written entirely in Java and is distributed as a GATE plugin. To our best knowledge, at the time of writing this article, there are not any other free lemmatization tools specifically targeting the B...
متن کامل